Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing
نویسندگان
چکیده
We describe a “direct modeling” approach to using prosody in various speech technology tasks. The approach does not involve any hand-labeling or modeling of prosodic events such as pitch accents or boundary tones. Instead, prosodic features are extracted directly from the speech signal and from the output of an automatic speech recognizer. Machine learning techniques then determine a prosodic model, which is integrated with lexical and other information to predict the target classes of interest. We discuss task-specific modeling and results for a line of research covering four general application areas: (1) structural tagging (finding sentence boundaries, disfluencies), (2) pragmatic and paralinguistic tagging (classifying dialog acts, emotion, and “hot spots”), (3) speaker recognition, and (4) word recognition itself. To provide an idea of performance on realworld data, we focus on spontaneous (rather than read or acted) speech from a variety of contexts—including human-human telephone conversations, game-playing, human-computer dialog, and multi-party meetings.
منابع مشابه
Prosody Modeling for Automatic Speech Understanding: An Overview of Recent Research at SRI
Prosody has long been studied as an important knowledge source for speech understanding. In recent years there has been a large amount of computational work aimed at prosodic modeling for automatic speech recognition and understanding. Whereas most current approaches to speech processing model only the words, prosody provides an additional knowledge source that is inherent in, and exclusive to,...
متن کاملA general-purpose 32 ms prosodic vector for hidden Markov modeling
Prosody plays a central role in conversation, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representa...
متن کاملAutomatic labeling of Japanese prosody using j-toBI style description
Speech corpora with prosodic labels are getting more and more important not only for speech synthesis but also for discourse modeling. A widely used labeling system for Japanese prosody, J-ToBI, however, is insufficient for applications like discourse modeling and it even lacks an accurate method for automatic labeling. In this paper, we propose an automatic labeling method for J-ToBI style des...
متن کاملProsody Modeling for Automatic Speech Recognition and Understanding
This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word-based) information. We then survey a number of applications of the framework, and give results for automati...
متن کاملProsodic models, automatic speech understanding, and speech synthesis: towards the common ground
Automatic speech understanding and speech synthesis, two of the major speech processing applications, impose strikingly different constraints and requirements on prosodic models. The prevalent models of prosody and intonation fail to offer a unified solution to these conflicting constraints. As a consequence, prosodic models have been applied only occasionally in end-toend automatic speech unde...
متن کامل